In order to get started, we should first prepare our R
environment and load the packages we will need later on. Additionally,
the data used in this practical are stored on Zenodo and we can set the
base path for the downloads.
library("tidyverse") # for general data wrangling and plotting
library("SIAMCAT") # for statistical and ML analyses
data.loc <- 'https://zenodo.org/api/files/d81e429c-870f-44e0-a44a-2a4aa541b6c1/'
In this practical, we are going to have a look at the data from Zeller et al. MSB 2014. In this study, the authors recruited patients with colorectal cancer (CRC) and healthy controls (CTR) and performed shotgun metagenomic sequencing of fecal samples. The raw data have already been pre-processed and analyzed with the mOTUs taxonomic profiler.
First, we are going to load the taxonomic profiles and store them as a matrix.
fn.feat.fr <- paste0(data.loc, 'specI_Zeller.tsv')
feat.fr <- read.table(fn.feat.fr, sep='\t', quote="",
check.names = FALSE, stringsAsFactors = FALSE)
feat.fr <- as.matrix(feat.fr)
Additionally, we also need the information which sample belongs to which group. Therefore, we are loading the metadata table as well:
fn.meta.fr <- paste0(data.loc, 'meta_Zeller.tsv')
df.meta <- read.table(fn.meta.fr)
df.meta
table(df.meta$Group)
##
## CRC CTR
## 53 88
First, we can have a look at the library size across samples:
options(repr.plot.width=5, repr.plot.height=5)
hist(colSums(feat.fr), 30, col='slategray',
main='Library Size Histogram', xlab='Library Size')
The above plots strongly suggests to correct for differences in library size. Here we use the simplest approach: conversion to relative abundances (aslo known as total sum scaling). Alternatives are rarefying (i.e. downsampling) or DESeq’s library size normalization…
feat.fr.rel <- prop.table(feat.fr, 2)
Next, we can have a look at the prevalence of the bacterial features across all samples. The assumption would be that features that are present in only a handful of samples are unlikely to play a major role in the gut ecosystem and their quantification has the greatest uncertainty.
options(repr.plot.width=8, repr.plot.height=5)
hist(rowMeans(feat.fr.rel != 0), 100, col='slategray',
xlab='Prevalence of bacterial species', main='Prevalence histogram')
Most of the features are present in none or only a few of the samples and we can therefore discard them.
f.idx <- names(which(rowMeans(feat.fr.rel != 0) > 0.05))
# remove the Unmapped part as well
f.idx <- setdiff(f.idx, 'UNMAPPED')
feat.fr.rel.filt <- feat.fr.rel[f.idx,]
How does the abundance vary across these different bacterial features?
# resize plots in Jupyter Notebook to more convenient dimensions
options(repr.plot.width=8, repr.plot.height=5)
# rank-abundance plot
rnk <- order(apply(feat.fr.rel.filt, 1, median), decreasing=TRUE)
boxplot(log10(t(feat.fr.rel.filt[rnk,]) + 1E-6), las=2, cex=0.3, pch=16, lty=1,
col='slategray',
xlab='Species rank', ylab='Relative abundance (log10)', xaxt='n')
Let’s take two examples and visualize them. The top feature is Bacteroides dorei/ Bacteroides vulgatus:
# resize plots in Jupyter Notebook to more convenient dimensions
options(repr.plot.width=5, repr.plot.height=5)
# one of the most abundant species
hist(log10(feat.fr.rel.filt[40,] + 1E-6), 20,
col='slategray', main='Histogram (B. vulgatus)',
xlab='log10(Relative Abundance)')
A bacterium with a more bi-modal distribution is Prevotella copri:
# resize plots in Jupyter Notebook to more convenient dimensions
options(repr.plot.width=5, repr.plot.height=5)
# one of the most abundant species
hist(log10(feat.fr.rel.filt[20,] + 1E-6), 20,
col='slategray', main='Histogram (P. copri)',
xlab='log10(Relative Abundance)')
Now that we have set up everything, we can test all microbial species
for statistically significant differences between the CTR
and CRC groups. In order to do so, we perform a Wilcoxon
test on each individual bacterial species. Let’s start with one of the
species:
x <- feat.fr.rel.filt[40,]
y <- df.meta$Group
wilcox.test(x~y)
##
## Wilcoxon rank sum test with continuity correction
##
## data: x by y
## W = 2436, p-value = 0.6595
## alternative hypothesis: true location shift is not equal to 0
The result seems to be that there is no significant difference in the relative abundance of Bacteroides dorei/vulgatus between these two groups. We can also visualize this with a boxplot:
boxplot(log10(x+1E-06)~y, xlab='', ylab='log10(Relative Abundance)',
main='B. dorei difference', col=c('firebrick', 'slategrey'))
Now, we can just run this test for all individual bacteria:
p.vals <- rep_len(1, nrow(feat.fr.rel.filt))
names(p.vals) <- rownames(feat.fr.rel.filt)
stopifnot(all(rownames(df.meta) == colnames(feat.fr.rel.filt)))
for (i in rownames(feat.fr.rel.filt)){
x <- feat.fr.rel.filt[i,]
y <- df.meta$Group
t <- wilcox.test(x~y)
p.vals[i] <- t$p.value
}
head(sort(p.vals))
## unclassified Fusobacterium [Cluster1482]
## 3.370035e-10
## unclassified Fusobacterium [Cluster1481]
## 2.042310e-08
## Pseudoflavonifractor capillosus [Cluster1579]
## 2.749302e-06
## Fusobacterium nucleatum [Cluster1479]
## 1.316326e-05
## Porphyromonas asaccharolytica [Cluster1056]
## 1.914120e-05
## Prevotella nigrescens [Cluster1069]
## 5.114510e-05
The species with the most significant effect seems to be a Fusobacterium species, so let us take a look at the distribution of this species:
x <- feat.fr.rel.filt['unclassified Fusobacterium [Cluster1482]',]
boxplot(log10(x+1E-06)~y, xlab='', ylab='log10(Relative Abundance)',
main='Fusobacterium difference', col=c('firebrick', 'slategrey'))
Today, we will use the SIAMCAT package to train machine
learning models on microbiome data. All the data are stored in the
SIAMCAT object which contains the feature matrix, the
metadata, and information about the groups you want to compare.
sc.obj <- siamcat(feat=feat.fr.rel, meta=df.meta,
label='Group', case='CRC')
## + starting create.label
## Label used as case:
## CRC
## Label used as control:
## CTR
## + finished create.label.from.metadata in 0.015 s
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 141 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.025 s
We can use SIAMCAT for feature filtering as well:
sc.obj <- filter.features(sc.obj, filter.method = 'prevalence', cutoff = 0.05)
## Features successfully filtered
sc.obj
## siamcat-class object
## label() Label object: 88 CTR and 53 CRC samples
## filt_feat() Filtered features: 358 features after prevalence filtering
##
## contains phyloseq-class experiment-level object @phyloseq:
## phyloseq@otu_table() OTU Table: [ 1754 taxa and 141 samples ]
## phyloseq@sam_data() Sample Data: [ 141 samples by 5 sample variables ]
SIAMCAT offers a few normalization approaches that can
be useful for subsequent statistical modeling in the sense that they
transform features in a way that can increase the accuracy of the
resulting models. Importantly, these normalization techniques do not
make use of any label information (patient status), and can thus be
applied up front to the whole data set (and outside of the following
cross validation).
sc.obj <- normalize.features(sc.obj, norm.method = 'log.std',
norm.param = list(log.n0=1e-06, sd.min.q=0))
## Features normalized successfully.
sc.obj
## siamcat-class object
## label() Label object: 88 CTR and 53 CRC samples
## filt_feat() Filtered features: 358 features after prevalence filtering
## norm_feat() Normalized features: 358 features normalized using log.std
##
## contains phyloseq-class experiment-level object @phyloseq:
## phyloseq@otu_table() OTU Table: [ 1754 taxa and 141 samples ]
## phyloseq@sam_data() Sample Data: [ 141 samples by 5 sample variables ]
Cross validation is a technique to assess how well an ML model would generalize to external data by partionining the dataset into training and test sets. Here, we split the dataset into 10 parts and then train a model on 9 of these parts and use the left-out part to test the model. The whole process is repeated 10 times.
sc.obj <- create.data.split(sc.obj, num.folds = 10, num.resample = 10)
## Features splitted for cross-validation successfully.
Now, we can train a LASSO logistic regression classifier in order to distinguish CRC cases and controls.
sc.obj <- train.model(sc.obj, method='lasso')
## Trained lasso models successfully.
This function will automatically apply the models trained in cross validation to their respective test sets and aggregate the predictions across the whole data set.
sc.obj <- make.predictions(sc.obj)
## Made predictions successfully.
Calling the evaluate.predictions function will result in
an assessment of precision and recall as well as in ROC analysis, both
of which can be plotted as a pdf file using the
model.evaluation.plot function (the name of/path to the pdf
file is passed as an argument).
sc.obj <- evaluate.predictions(sc.obj)
## Evaluated predictions successfully.
model.evaluation.plot(sc.obj, fn.plot = './figures/eval_plot.pdf')
## Plotted evaluation of predictions successfully to: ./figures/eval_plot.pdf
## Model Interpretation
Finally, the model.interpretation.plot function will
plot characteristics of the models (i.e. model coefficients or feature
importance) alongside the input data aiding in understanding how / why
the model works (or not).
model.interpretation.plot(sc.obj, consens.thres = 0.7,
fn.plot = './figures/interpretation_plot.pdf')
## Successfully plotted model interpretation plot to: ./figures/interpretation_plot.pdf